{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recursively split by character\n",
    "This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is [\"\\n\\n\", \"\\n\", \" \", \"\"]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.\n",
    "\n",
    "此文本拆分器是通用文本的推荐文本拆分器。它由字符列表参数化。它试图按顺序拆分它们，直到块足够小。默认列表为 [\"\\n\\n\", \"\\n\", \" \", \"\"] 。这样做的效果是试图将所有段落（然后是句子，然后是单词）尽可能长时间地放在一起，因为这些段落通常似乎是语义上最强的文本片段。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='meow meow🐱'),\n",
       " Document(page_content='meow meow🐱'),\n",
       " Document(page_content='meow😻😻'),\n",
       " Document(page_content='sss\\n aaa')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "# This is a long document we can split up.\n",
    "with open(\"../data/meow.txt\") as f:\n",
    "    state_of_the_union = f.read()\n",
    "\n",
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    # Set a really small chunk size, just to show.\n",
    "    chunk_size=20,\n",
    "    chunk_overlap=5,\n",
    "    length_function=len,\n",
    "    is_separator_regex=False,\n",
    ")\n",
    "texts = text_splitter.create_documents([state_of_the_union])\n",
    "texts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Document(page_content='meow meow🐱 \\n meow meow🐱 \\n meow😻😻\\n\\n sss\\n aaa')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_splitter = RecursiveCharacterTextSplitter(\n",
    "    separators=[\n",
    "        \"\\n\\n\",\n",
    "        \"\\n\",\n",
    "        \" \",\n",
    "        \".\",\n",
    "        \",\",\n",
    "        \"\\u200b\",  # Zero-width space\n",
    "        \"\\uff0c\",  # Fullwidth comma\n",
    "        \"\\u3001\",  # Ideographic comma\n",
    "        \"\\uff0e\",  # Fullwidth full stop\n",
    "        \"\\u3002\",  # Ideographic full stop\n",
    "        \"\",\n",
    "    ],\n",
    "    # Existing args\n",
    ")\n",
    "texts = text_splitter.create_documents([state_of_the_union])\n",
    "texts"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain0_1",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
